This report conducts an equity analysis of air quality in San Mateo County. We use PurpleAir datasets to get the indoor and outdoor air quality data.
The report takes two cities (Menlo Park and Redwood City) as examples and analyzes their geographic equity by mapping the Air Quality Index(AQI) of each block group and plotting the average PM 2.5 in February, 2022. The population equity is conducted by comparing the PM 2.5 distribution across different income groups and different racial groups. Finally, in data equity part, the report gives some suggestions to the suppliers of PM 2.5 sensors by proposing two methodologies to access which block groups have greater demand to PM 2.5 sensors.
Raw senors data is extracted from PurpleAir and then is converted into general PM2.5 and AQI. There are 1055 sensors in San Mateo County in total. The following mapping shows the relative AQI of each sensor in the county. In order to better distinguish between good and bad air quality in various areas, we show “relative values” on the map. This allows for a better understanding of which areas are in the top 20% of air quality and which ones are in the bottom 20%. From the map we can see the AQI in the place near East Palo Alto and Redwood city is not as good as other places. Actually, the absolute AQI of most block groups is Good in the county.
We use the voronoi technique to transform point-estimates of outdoor air quality to census block groups. The following mapping shows the results after voronoi splitting. In each sub-area, if there are two or more sensors, we take the average value of multiple sensors as the API value of the area; for areas without sensors, we take NA; and for areas with only one sensor, we just take the value of this sensor.
The map below shows the result after voronoi interpolation at the block groups level. From the mapping we can see that the places near the bay tends to have higher PM2.5. This may be due to urban planning zoning, residential areas have relatively better air quality.
And this chart shows the outdoor PM2.5 level in February 2022 in the Menlo Park. The PM2.5 in Menlo Park City fluctuates in this month. Outdoor PM2.5 values rise on weekends and fall on weekdays. In addition, we can also find that in the second half of February, the average PM2.5 value was relatively lower, and its fluctuation degree is reduced.
Similarly, the following mapping shows the results after voronoi splitting in Redwood City. We can find that in the area near the bay, the number of sensors is very small, and we can only take the PM2.5 value of the sensor that is located in the relatively adjacent area as the PM2.5 value of the area. But this may not be accurate.
Then, we show the result after voronoi interpolation at the block groups level. From the mapping we can see that compared to Menlo Park, the city’s air quality seems a little worse than Menlo Park just according to the PM2.5 level. Besides, the PM 2.5 level is higher in the west and also in the center than other places. Air quality near the Bay Area is likely to be inaccurate due to the insufficient number of sensors. But residential areas in redwood city also show relatively high PM2.5 values.
Here we can see the outdoor PM2.5 level in February 2022 in the Redwood City. Similarly, the PM2.5 in the city also fluctuates in this month. Generally, we find that the PM2.5 tends to be relatively high in weekend and low in weekday for these two cities, which makes sense because more people tend to go out in weekends.
Overall, the air quality of Redwood city and Menlo park is very similar in February, the peak value of PM2.5 is between 15-16 and the lowest value is about 4-5. But in terms of indoor air quality, the air quality in redwood city is relatively poor. This could be due to an outdated ventilation system in building, etc. See dashboard for PM2.5: https://yaojinghuanghe.shinyapps.io/dashboard_pm25/
Please check this dashboard for equity analysis: https://yaojinghuanghe.shinyapps.io/dashboard_pm25_equity/
We collect the income data in San Mateo County using ACS 5-years dataset (2019) at the block groups and divide income into four levels: Less than $24,999(low), $25,000 to $44,999(median low), $45,000 to $99,999(median high), $100,000 or more(high). We divide the PM 2.5 into 5 levels.
From the following equity analysis figure we can see that PM 2.5 exposure degree is unequal among different income groups. High income groups are less exposed to bad air quality (in terms of PM 2.5) than they ‘should’ be (based on their group population percentage). Nevertheless, low income and median low income groups are more exposed to bad air quality than they ‘should’ be.
When choosing a residential area, high-income groups can choose areas relatively far from industrial areas, commercial areas, and closer to the suburbs, which have better outdoor air quality. Indoor air quality can be improved by repairing building ventilation systems, purchasing air purification devices, etc.
We collect the census race data in San Mateo County using decennial data (2010-2020) at the block level and divide races into six categories: American Indian and Alaska Native alone, Asian alone, Black or African American alone, Native Hawaiian and Other Pacific Islander alone,Two or more races and White alone.
From the following equity analysis figure we can see that PM 2.5 exposure degree is obviously unequal among different races than that among different income groups. White people are less exposed to bad air quality (in terms of PM 2.5) than they ‘should’ be (based on their group population percentage).
At the same time, we found that the distribution of Asians is also very special. It does not show a normal distribution, nor does it have a single trend towards a certain situation. The numbers of Asians living in areas with the highest and lowest PM2.5 levels are relatively similar. This may be due to the fact that some Asians tend to live in downtown while others tend to live in the suburbs.
The population equity analysis above is based on the assumption that our PM2.5 data is collected equally or evenly among different groups. But in reality, it will never happen because of some reasons. For example, suppliers may not be willing to install in places with relatively small population or relatively backward economic level, because it will not produce a lot of economic benefits. But we still need it. We still need to make the data collection as equal as possible since that is the promise of any further analysis. So, We try to design a score/scores for the County which should communicate the degree to which information on the air quality of different population groups is disproportionately available, due to the availability of sensors. In this section, we propose a set of score metric at the block group level which can shows the neediness of different races, income groups and areas(since there is still no sensor in some block groups). Two different quantitative models are presented when the neediness scores of every jurisdiction’s score is calculated. The main idea of our method is if a place already has more sensors than they ‘should’ have (in terms of races, income groups and area coverage), the neediness score of the group of area should be low.
First, we need to identify the reasonable coverage of one sensor. Taking the detection point of each outside pure air as the center of the circle, draw a series of circular areas with a radius of 1 / 8 mile (200 meters). We believe that the air quality within the distance of 1 / 8 mile can be represented by one air detection point. Therefore, the drawn figure is the area covered by all air monitoring points in San Mateo county.
Then we look into all census block groups of San Mateo to study the demand degree of each census block for additional monitoring sensors, and design a scoring rules to give out score. The higher the score, the more vulnerable the area is and the more monitoring sensors are needed.
We want to collect data among different races equally. For example, assume the population of a certain race is \(p\) in the county and the population of this race who are in the monitoring area (percent with data) is \(p_s\). Ideally, \(p_w=p/p_s\) should be same among different races. But it will never happen as mentioned above. We assign the high \(p_w\) race with low score and low \(p_w\) race with high score as a way to balance them. We can use similar principle to achieve collect data among different income groups equally and collect data among different area equally.
For races, income groups, and cover areas, we can get the percent with data table in the step.
| race | pop_withdata | pop | perc_withdata |
|---|---|---|---|
| American Indian and Alaska Native alone | 581.191 | 6812 | 0.0853187 |
| Asian alone | 27173.845 | 230242 | 0.1180230 |
| Black or African American alone | 1534.139 | 15707 | 0.0976723 |
| Native Hawaiian and Other Pacific Islander alone | 833.622 | 9302 | 0.0896175 |
| Some Other Race alone | 8778.023 | 107924 | 0.0813352 |
| Two or more races | 13181.564 | 94267 | 0.1398322 |
| White alone | 55477.500 | 300188 | 0.1848092 |
| cbg | perc_area |
|---|---|
| 060816001001 | 0.0290757 |
| 060816001002 | 0.2605928 |
| 060816001003 | 0.8335262 |
| 060816003002 | 0.0413046 |
| 060816004011 | 0.1475284 |
| income | pop_withdata | pop | perc_withdata |
|---|---|---|---|
| $100,000 or more | 30716.023 | 154403 | 0.1989341 |
| $25,000 to $44,999 | 4062.411 | 23512 | 0.1727803 |
| $45,000 to $99,999 | 10563.832 | 61629 | 0.1714101 |
| Less than $24,999 | 4080.153 | 23999 | 0.1700135 |
Based on the principle, we have two quantitative methods for assigning the neediness scores.
See this dashboard for the result: https://yaojinghuanghe.shinyapps.io/dashboard_data_equity_score_perc/
This method calculates the score based on the comparison between the percent with data of each member. For example, in race coverage, we compare the different races. Specifically, We map the percent with data into range (0,1) or we standardize the percent with data as score. Besides, the score should be low (like penalty) for high percent with data. So, we use the following equation to assign scores.
\[score=1-\frac{p_w-min(p_w)}{max(p_w)-min(p_w)}\]
The following table shows the scores we get with this method.
| race | pop_withdata | pop | perc_withdata | score |
|---|---|---|---|---|
| White alone | 55477.500 | 300188 | 0.1848092 | 0.0000000 |
| Two or more races | 13181.564 | 94267 | 0.1398322 | 0.4346694 |
| Asian alone | 27173.845 | 230242 | 0.1180230 | 0.6454398 |
| Black or African American alone | 1534.139 | 15707 | 0.0976723 | 0.8421137 |
| Native Hawaiian and Other Pacific Islander alone | 833.622 | 9302 | 0.0896175 | 0.9199578 |
| American Indian and Alaska Native alone | 581.191 | 6812 | 0.0853187 | 0.9615026 |
| Some Other Race alone | 8778.023 | 107924 | 0.0813352 | 1.0000000 |
| income | pop_withdata | pop | perc_withdata | score |
|---|---|---|---|---|
| $100,000 or more | 30716.023 | 154403 | 0.1989341 | 0.0000000 |
| $25,000 to $44,999 | 4062.411 | 23512 | 0.1727803 | 0.9043295 |
| $45,000 to $99,999 | 10563.832 | 61629 | 0.1714101 | 0.9517088 |
| Less than $24,999 | 4080.153 | 23999 | 0.1700135 | 1.0000000 |
| cbg | perc_area | score |
|---|---|---|
| 060816001001 | 0.0290757 | 0.9661246 |
| 060816001002 | 0.2605928 | 0.6963898 |
| 060816001003 | 0.8335262 | 0.0288795 |
| 060816003002 | 0.0413046 | 0.9518771 |
| 060816004011 | 0.1475284 | 0.8281183 |
After we get the score for each factor, we can calculate the final neediness score for each block group. For example, when we calculate the race score for a specific block group, we just need to multiply the score of each race with the population of each race in this block group and sum them up and finally divide the sum by total population in the block group (weighted average). Finally, we get the follow mapping. There are three scores for each block groups.
We can select the block groups that very need more sensors based on different scores. For example, if we just want to make the data collection among different races become more equal, we can select the places with high scores in Race Score layer, such as some block groups near East Palo Alto. Or one can weight these scores according to their concerns and get a new score.
See this dashboard for the result: https://yaojinghuanghe.shinyapps.io/dashboard_data_equity_score/
Another method is rank method with exponential decay (as follow). This method takes into account the decay of its utility as the number of sensors increases.
\[score=e^{-\lambda Rank(p_w)}\] Take race score as an example, We give the highest score (1) to the race whose \(p_w\) is minimum (in our case, Some Other Race alone), and the rest decrease exponentially, with the race whose \(p_w\) is maximum (in our case, white) accounting for half of the score. Next, we give scores according to the coverage area of air quality inspection sensors. Since the coverage rate of many regions is as high as 100%, we regard them as the first place in parallel. According to the ranking, the higher the coverage, the lower the score, which proves that they have received enough coverage. Similar to investigating ethnic differences, we investigated whether there were income differences in the distribution of air quality probes. We found that probes were least distributed among middle-income people and were most distributed among the people with the highest income. We therefore rated air quality probe exposure for each income group and calculated a weighted average for each census block group.
The following table shows the scores we get with this method.
| race | pop_withdata | pop | perc_withdata | rank | score |
|---|---|---|---|---|---|
| Some Other Race alone | 8778.023 | 107924 | 0.0813352 | 1 | 1.0000000 |
| American Indian and Alaska Native alone | 581.191 | 6812 | 0.0853187 | 2 | 0.9057237 |
| Native Hawaiian and Other Pacific Islander alone | 833.622 | 9302 | 0.0896175 | 3 | 0.8203354 |
| Black or African American alone | 1534.139 | 15707 | 0.0976723 | 4 | 0.7429971 |
| Asian alone | 27173.845 | 230242 | 0.1180230 | 5 | 0.6729501 |
| Two or more races | 13181.564 | 94267 | 0.1398322 | 6 | 0.6095068 |
| White alone | 55477.500 | 300188 | 0.1848092 | 7 | 0.5520448 |
| income | pop_withdata | pop | perc_withdata | rank | score |
|---|---|---|---|---|---|
| Less than $24,999 | 4080.153 | 23999 | 0.1700135 | 1 | 1.0000000 |
| $45,000 to $99,999 | 10563.832 | 61629 | 0.1714101 | 2 | 0.8408964 |
| $25,000 to $44,999 | 4062.411 | 23512 | 0.1727803 | 3 | 0.7071068 |
| $100,000 or more | 30716.023 | 154403 | 0.1989341 | 4 | 0.5946036 |
| cbg | score_cover_area |
|---|---|
| 060816001001 | 0.8184677 |
| 060816001002 | 0.6284733 |
| 060816001003 | 0.5017759 |
| 060816003002 | 0.8083739 |
| 060816004011 | 0.7039799 |
Similarly, we can also get a score mapping using this method. After comparison we can find that the main results of these two score methods are similar. Physically, the rank method more intuitive for race score and income score. However, percent method seems more sensitive for cover area score. This might because there is no sensor monitoring area in many block groups (too many rank = 1, next rank might be 40 rather rather 2 or 3) . These places’ scores should be high but should not be too far away from the places with a little sensor monitoring coverage.
However, we also find a common flaw in both scoring methodologies. i.e. None of them take into account the actual situation of each area. For example, it is concluded that we should install more sensors in the western part of the San Mateo, and in the eastern part near the Dumbarton Bridge. But most of these areas are nature reserves or parks, and it might be helpful to do another assessment, as few people will be in these areas year-round. However, for research and monitoring needs, it could be necessary to install sensors that is more suitable for these areas.
It is an interesting and critical topic to conduct equity analysis. We can always get something new when doing this. Geographic equity is common or intuitive. We can also think of it in our brains. We might can deduce the distribution of geographic equity just based on some geographical knowledge (just for example, maybe air circulation is poor in some places, which cause higher PM2.5 than other places). For population equity, we might also have a whole picture based on our experience. For example, high income might always bring high life quality, such living in a place with low PM2.5.
We raise these inequalities, but we rarely think about their solutions. One of the possible approach is to eliminate or reduce data inequalities so we can better understand the full picture of the problem. Data equity is an important or essential stuff that we always ignore. When assessing which areas need to install more sensors, in addition to a data-driven approach, we need to consider other practical factors such as land planning zoning, related policies, installation and maintenance costs, etc.